arXiv: 1507.00805v2 [cs.DS] 6Jul2015 


Online Self-Indexed Grammar Compression* 


Yoshimasa Takabatake^, Yasuo Tabei^, and Hiroshi Sakamoto^ 

^ Kyushu Institute of Technology {tcikabatake,hiroshi}Sdonald.ai.kyutech.ac.jp 
^ PRESTO, Japan Science and Technology Agency tabei.y.aaOm.titech.ac. jp 


Abstract. Althongh several grammar-based self-indexes have been pro¬ 
posed thus far, their applicability is limited to offline settings where whole 
input texts are prepared, thus requiring to rebuild index structures for 
given additional inputs, which is often the case in the big data era. In 
this paper, we present the first online self-indexed grammar compres¬ 
sion named OESP-index that can gradually build the index structure 
by reading input characters one-by-one. Such a property is another ad¬ 
vantage which enables saving a working space for construction, because 
we do not need to store input texts in memory. We experimentally test 
OESP-index on the ability to build index structures and search query 
texts, and we show OESP-index’s efficiency, especially space-efficiency 
for building index structures. 


1 Introduction 


Text collections including many repetitions, so called highly repetitive texts, have 
become common. Version controlled software stores a large amount of documents 
with small differences. The current sequencing technology enables us to read in¬ 
dividual genomes quickly and economically, which generates large databases of 
thousands of human genomes [3] . The genetic difference between individual hu¬ 
man genomes is said to be approximately 0.1 percent, thus making the collection 
highly repetitive. There is therefore a strong need for developing powerful meth¬ 
ods to store and process repetitive text collections on a large-scale. 

Self-indexes aim at representing a collection of texts in a compressed format 
that supports the random access to any position and also provides query searches 
on the collection. Although grammar-based self-indexes are especially effective 
for processing highly repetitive texts and several grammar-based self-indexes 
have been proposed [5161112115] (See Tabled]), their applicability is limited to 
offline cases where all the text collections are given in advance, thus requiring to 
rebuild indexes when additional texts are given. Evenworse, they need to store 
whole input texts in memory for constructing indexes, which requires a large 
amount of working space. The problem is especially serious when we process 
massive collections of highly repetitive texts, which is ubiquitous in the big data 
era. An open challenge is to develop an online self-indexed grammar compression 
not only with a small working space for a large input but also with a functionality 
of updating data structures for building self-indexes from new additional texts. 


* This work was supported by JSPS KAKENHI(24700140,26280088) and the JST 
PRESTO program 




Table 1. Comparison with offline methods. Construction time, search time and 
extraction time are presented in big O notation that is omitted for space limita¬ 
tions. N is the length of text, m is the length of query pattern, n is the number 
of variables in a grammar, a is alphabet size, h is the height of the parse tree 
of the straight line program, z is the number of phrases in LZ77, d is the length 
of nesting in LZ77, occ is the number of occurrences of query pattern in a text, 
occq is the number of candidate appearances of query patterns, lg*is the iterated 
logarithm, and a G (0,1] is a load factor for a hash table. Ig stands for log 2 . 
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Edit-sensitive parsing (ESP) [4] is an efficient parsing algorithm originally de¬ 
veloped for approximately computing edit distances with moves between texts. 
ESP builds from a given text a parse tree that guarantees upper bounds of pars¬ 
ing discrepancies between different appearances of the same subtext. Maruyama 
et al. m presented a grammar-based self index called ESP-index on the notion 
of ESP and Takabatake et al. m improved ESP-index for fast query searches 
by using GMR’s rank/select operations for general alphabet [7]. Unlike other 
grammar-based self-indexes, they perform top-down searches for finding candi¬ 
date appearances of a query text on the data structure by leveraging the upper 
bounds of parsing discrepancies in ESP. However, their applicability is limited 
to offline cases. 

In this paper, we present an online self-indexed grammar compression named 
OESP-index for building a self-index by reading input characters one-by-one. 
As far as we know, OESP-index is the first method for building grammar-based 
self-indexes in an online manner. OESP-index is built on the notion of ESP and 
its data structures are constructed by leveraging the idea behind fully-online 
LCA (FOLCA) |13I12] . an efficient online grammar compression that builds a 






















context-free grammar (CFG) from an input text and encodes it into a succinct 
representation. We present a novel query search and random access algorithms 
for OESP-index and discuss their efficiency. 

Experiments were performed on retrieving query texts from a benchmark 
collection of highly repetitive texts. The performance comparison with other 
algorithms demonstrates OESP-index’s superiority. 

2 Preliminaries 

2.1 Basic notations 

Let if be a finite alphabet and a = |i7|. The length of string S is denoted by 
[S']. The set of all strings over et S is denoted by E*. The set of all strings 
of length k is denoted by if^. We assume a recursively enumerable set X of 
variables with EflX = 0. S[i] and S'[d j] denote the i-th symbol of string S and 
the substring from S[i] to S[j], respectively. Ig stands for log 2 . Let Ig^^^ u = \gu, 
Ig*-*^^^ M = lglg^*^u, and lg*it = min{i | Ig^*^ it < 1}. Practically lg*it = 0(1) 
since lg*it < 5 for u < 


2.2 Straight-line program (SLP) 

A context-free grammar (CFG) in Chomsky normal form is a quadruple G = 
(E,V, D, Xs) where P is a finite subset of A, O is a finite subset of P x 
(P U E)^ and Xs S P is the start symbol. An element in D is called pro¬ 
duction rule. A variable in P is called nonterminal symbol. val{Xi) denotes the 
string derived from Xi G V. For Xi,X 2 , ■■■,Xk G V, let i'a/(Xi, X 2 ,..., X^) = 
val{Xi)val{X 2 )...val{Xk). A grammar compression of S' is a CFG that derives 
S and only S. The size of a CFG is the number of variables, i.e., |P| and let 
n = |P|. 

The parse tree of G is a rooted ordered binary tree such that (i) internal 
nodes are labeled by variables in P and (ii) leaves are labeled by symbols in 
E, i.e., the label sequence in leaves is equal to input string S. In a parse tree, 
any internal node Z corresponds to a production rule Z —^ XP and has a left 
child with label X and a right child with label Y. A partial parse tree [18] is 
an ordered tree formed by traversing the parsing tree in a depth-first manner 
and pruning out all descendants under every node of variables appearing no less 
than twice. 

Straight-line program (SLP) |10j is defined as a grammar compression over 
XUP and its production rules are in the form of Xk —?■ XiXj where X^, X^, Xj G 
X U P and 1 < i,j < k < n <7. 

2.3 Phrase dictionary and reverse dictionary 

A phrase dictionary is a data structure for directly accessing a digram XiXj 
from a given X^ if X^ XiXj G D. It is typically implemented by an array 


( I ) Parse tree for a post-order SLP (||) Post order partial parse tree 
(POPPT) 



(III) Self-index structure 

(i) Succinct representation 
ofPOPPT(ll) 

6:0010101011 


(ii) Hash table for the reverse 
dictionary of the POSLP 
The hash table is stored 
in main memory. 
ab -* Xi X2X1 X3 
Xib^Xi X3X2 ^ Xi 


(iii) Wavelet tree indexing 
a symbol sequence at 
leaves in the POPPT 

Lrabbx^x^ 

(iv) Array named length 
array storing lengths of 
strings derived by 
non-terminal symbols 

R: 2 3 5 8 


Fig. 1. Example of parse tree, post order partial parse tree and self-index struc¬ 
ture. The self-index structure consists of four data structures which are directly 
built from the parse tree. 


requiring 2n log (n -I- a) bits for storing n production rules. A reverse dictionary 
D~^ is a mapping from a digram to an associated variable. D~^{XY) returns 
the variable Z ii Z ^ XY G D\ otherwise, it creates a new variable Z' and 
returns Z'. 


2.4 Succinct data structures 

We use the fully indexable dictionary (FID) for indexing bit strings. Our method 
represents CFGs using a rank/select dictionary, a succinct data structure for a 
bit string B supporting the following queries: rankc(i?,I) returns the number 
of occurrences of c G {0,1} in B[0, 1]; selectc(B, *) returns the position of the i-th 
occurrence of c G {0,1} in B] euCcess{B,i) returns i-th bit in B. Data structures 
with \B\ + o(|i3|) bit storage to achieve 0(1) time rank and select queries [17] 
have been presented. 

For online grammar compression, we adopt the dynamic range min/max tree 
(DRMMT) [16] for online construction of parse tree. We can obtain parent{B, i), 
the parent of node i of DRMMT B in time where n is the number of 

nodes of the tree. We consider the wavelet tree (WT) [5], an extension of FID 
for general alphabet. A WT is a data structure for a string over finite alphabets, 
and it can compute the rank and select queries on a string S over E* in 0(log a) 
time and using 151 logCT(l -I- o(l)) bits. 

3 Edit Sensitive Parsing (ESP) and Fully-online LCA 
(FOLCA) 

We review the ESP algorithm [4] and its online variant named FOLCA m in 
this section. The original ESP is an offline algorithm and builds a parse tree 
named ESP-tree from a given string. ESP-trees are complete, balanced binary 
trees each subtree of which is 2-tree in the form of A —>■ AB or 2-2-tree in the 
form of A —>• AY and Y —> BC. The algorithm partitions a string S into non¬ 
overlapping substrings SiS 2 - ■ ■ each of which belongs to one of three substring 
types. Typel is a substring of a repeated symbol, i.e., for a G A and fc > 1; 





Fig. 2. Example of dynamic wavelet tree. L is the leaf label of POPPT; Code 
is the integer representation of L; Bi is the bit vector representing elements in 
code; Only Ai at each node is stored. 


type2 is a substring longer than 21g* jS”! not including typel substrings; typed 
is a substring that is neither typel nor type2. 

The parsing algorithm parses each substring Si according to three substring 
types. For typel and typed substrings, the algorithm performs the left aligned 
parsing as follows. If l^il is even, the algorithm builds a 2-tree from — 1, 2j] 
for each j € {1,2,..., |5'i|/2}; Otherwise, the algorithm builds a 2-tree from 
Si\2j — l,2j] for each j G {1,2,..., — d)/2j}, and it builds a 2-2-tree from 

the last trigram Sals'll — 2,15^1]. For type2 substrings, the algorithm further 
partitions substring Si into short substrings of length two or three by using an 
efficient string partitioning procedure named alphabet reduction [4] , and it builds 
2-trees for substrings of length two and 2-2-trees for substrings of length three. 
The parsing algorithm generates a new shorter string S' of length from l^j/d to 
|S'|/2, and it parses S'. The above process iterates on the new sequence until its 
length is one. 

FOLCA is an online algorithm that builds ESP-tree as a post-order partial 
parse tree (POPPT) using the parsing rule in the ESP algorithm from a given 
string in an online manner. A POPPT and the post-order SLP (POSLP) corre¬ 
sponding to a POPPT are defined as follows. 


Definition 1 (POPPT and POSLP |13) L A POPPT is a partial parse tree 
whose internal nodes have post-order variables. A POSLP is an SLP whose par¬ 
tial parse tree is a POPPT. 


Figure[T]-(I) and -(II) show an example of parse tree and POPPT. 

Since FOLCA builds POPPT using the rules in the ESP algorithm, it can 
exploit advantages existing in both SLP and ESP-tree. Given a string S, FOLCA 
builds the POPPT of height 0(lg|S'|) in 0(|S'|lg* \S\) time. FOLCA’s worst- 
case approximation ratio to the smallest CFG is 0(lg* l^l). OESP-index directly 
encodes FOLGA’S POPPT into a succinct representation and build an index 
structure in an online manner for fast query searches and substring extractions, 
which is explaned in the next section. 











































































4 Index structure of OESP-index 


OESP-index’s succinct representation consists of four data structures: (i) B : 
succinct tree of POPPT, (ii) H : hash table (iii) L : non-negative integer array 
indexed by wavelet tree and (iv) R : non-negative integer array. 

Succinct tree of POPPT B is a bit string made by traversing POPPT in 
post-order, and putting ’0’ if a node is a leaf and T’ otherwise. The last bit ’1’ in 
B represents a virtual node and B is indexed by the DRMMT [T^] . The succinct 
tree supports the following three operations: parent{B, i) return the parent node 
of a node i; left_child{B,i) returns the left child of a node i; right^child{B,i) 
returns the right child of a node i; They are computed in 0(lgn/(lglgn)) time. 
The space for our succinct tree is at most 2n + o(n) bits. 

A reverse dictionary H : (V U A) x (V U A) —?> E is implemented by a chaining 
hash table. Let a be a constant called a load factor. The hash table has an entries 
and each entry stores a list of integers i representing the left hand side of a rule 
Xi XjXk- The size of the data structure is an Ig(n-l-cr) bits for the hash table 
and nig (n -I- a) bits for the lists. Thus, the total size is n(l -I- a) Ig (n -I- cr)) bits. 
The access time is expected 0{l/a) time. 

Non-negative integer array L stores symbols at leaves from the leftmost leaf to 
the rightmost leaf in the POPPT. L is indexed by dynamic wavelet tree (DWT) 
that is presented in the next subsection. Each element of non-negative integer 
array R is the length of the string derived from a variable, i.e., \val{Xi)\ for 
Xi e V. The size of R is nlgl^l bits. Figure [T]- (III)-(iv) shows an example of 
data structures. 

4.1 Dynamic -wavelet tree (DWT) 

Our DWT is a wavelet tree supporting a operation of adding an element to 
the tail of a sequence. Such a operation is called a pushback that is necessary 
for implementing DWT. A wavelet tree for sequence L over range of alphabet 
and variables [l..(n -1- a)] can be recursively described over sub-range [a.. 6] C 
[l..(n-|-tT)]. A wavelet tree over range [a. .6] is a binary balanced tree with b—a-\-l 
leaves. If a = 6, the tree is just a leaf labeled a. Otherwise it has an internal root 
node that represents L. The root has a bitstring Aroot[^, |*S'|] defined as follows: 
if L[i\ < (a -I- b)/2 then Aroot[i] = 0, else Aroot[i] = 1- We define Lo[l,^o] as 
the subsequence of L formed by the symbols c < (a -I- 6)/2, and Li[l,£i] as the 
subsequence of L formed by the symbols c > (o -I- b)/2. Then, the left child of 
the root is a wavelet tree for Lo[l,^o] over range [a., [(a -I- 6)/2j] and the right 
child of the root is a wavelet tree for Li[I,.^i] over range [I -I- [(a -I- b)/2\..h\. 

Implementing WTs without pointers uses a small space of nlg(n -|- tr) -|- 
o(nlg(n -I- cr)) bits, but supporting the pushback operation is difficult. Thus, 
we implement DWTs using pointers where the binary tree is explicitly repre¬ 
sented. When a new symbol exceeding the representation ability of the current 
binary tree in DWT is added to DWT, DWT adds new nodes to the binary 
tree, resulting in increasing the height of the tree. The space of DWT uses 
(3n -I- 2cr) lg(n -I- cr) -|- o{nlg{n + a)) bits. Figure [2] shows an example of DWT. 


Algorithm 1 NextCore on implicit parse tree POPPT(i?, L) 

1: v: the leftmost occurrence node of maximal core, p: empty stack 
2: function NextCore(u, p) 

3l if V ^ root then 

4: if V is the left child of parent{B,v) then > 

5l p.pushileft) 

6l else 

7l p.pushiright) 

8: end if 

9: NEXTCORE(parent(B, u), p) 

10: 2-^-1 

11: p.popQ 

12: while {u — selecti, (L, i)) ^ NULL do > (n,p): the next occurrence on explicit tree 

13: if u is left child of parent{B, u) then 

14: p.push{left) 

15: else 

16: p.push{right) 

17: end if 

18: NEXTCORE(parent(B, u), p) 

19: p.popO 

20 : 2-^2 + ! 

21: end while 

22: end if 

23: end function 


4.2 Complexity for building OESP-index 

Theorem 1. The size of OESP-index is nlglS”] + 0((n + cr) lg(n + a)) bits. 
The construction time is 0(i|S'| lg(n + ct) lg*|S'|) and the memory consumption 
is the same as the index size, where S is an input string, n is the number of 
variables, a is a load factor of the hash table, and we assume the size of alphabet 
is constant. The update time for the next input symbol is 0{^ lg(n + cr) lg*|S'|). 

Proof. The size of the length array i? is nig [S'] bits for n variables. The size of 
i? is 2n + o(n) bits and the size of L and Tl are 0{{n + a) lg(n + cr)) bits each. 
We can access Z = H{XY) in 0(1 /q;) time for a load factor a G (0,1]. The 
alphabet reduction is iterated at most Ig* IS”] times for each symbol. The time 
to get the parent and left/right children of a node in the partial parse tree is 
0(lg(n + cr)) using the rank/select over the DWT for L. Thus, the construction 
time of the parse tree is 0(i|S'| lg(n + cr) lg*|S'|). Analogously, the update time 
is clear. 

4.3 Query search and substring extraction 

For a node u of a parse tree of the string S € E*, and yield{vi ■ ■ • Vk) = 
yield{vi) ■ ■ • yield{vk). Label{v) denotes the label of v and Label{vi Vk) = 
Label{vi) ■ • ■ Label{vk). If Label{v) = X, yield{X) is identical to yield{v). lca{u, v) 
is the lowest common ancestor of u, v. For a pattern P G E*, nodes {ui ,... ,Vk} 
such that yield{vi ■ ■ • Vk) = P are called embedding nodes of P. For embedding 
nodes {ui,..., Vk}, string Q = Label{vi ■ ■ ■ Vk) is called an evidence of pattern P. 
Since the trivial evidence Q identical to P always exists, the notion of evidence 
is well-defined. In addition, for embedding nodes {ui ,... ,Vk}, a node 2 such that 
2 = lca(vi,Vk) is called an occurrence node of P 







The next theorem tells that we can find shorter evidence depending on |P|. 

Theorem 2. ( JTJ^) There exists an evidence Q = Qi • ■ ■ Qt of P such that each 
Qi is a maximal repetition or a symbol and t = 0(lg |P| lg*|5'|). 

The time to find the evidence Q of pattern P is bounded by the construction 
time of the parsing tree of P. In our data structure of OESP, the time to find 
the evidence Q is estimated as follows. 

Theorem 3. The time to find Q is 0(i|P| lg(n + cr) lg*|iS'|). 

Proof This bound is clear by Theorem [TJ 

Let us consider the simple case that \Qi\ = 1 for any i. In this case, Q 
contains no repetition such that Q = qi - ■ - qt € T'*. A symbol qk is called a 
maximal core if \yield{qk)\ > \yield{qi)\ for any i. For an internal node v of the 
parse tree T of S' with Label(v) = qk, an ancestor z of u is the occurrence node 
of P iff all gi,..., qk-i and qk+i, ■ ■ ■ ,qt can be embedded around v. Moreover, 
any occurrence node of P is restricted by the case Label{v) = qk- For the general 
case Qi = of [I >T), i.e., Qi is a repetition, we can reduce the embedding of 
of to the embedding of a string AB ■ ■ - C of length at most 0(lg.^) such that 
yield{AB ■ ■ - C) = a^. Thus, the embedding of typel string is easier than others, 
and then, without loss of generality, we can assume \Qi\ = I for any i. 

The remaining task of the search problem is the random access to all occur¬ 
rences of the maximal core qk over the POPPT, the pruned parse tree. By the 
definition of POPPT, the internal node with rank k is the leftmost occurrence 
of the symbol qk itself. In the previous indexes [nns], a next occurrence of qk 
is obtained using a data structure based on the renaming variables in a lexico¬ 
graphic order. This data structure is not, however, dynamically constructable. 
Therefore, we develop the search algorithm NextCore (Algorithm [Ij for the 
OESP-index. 

The NextCore visits all occurrences of the maximal core on the parse tree 
T using its implicit POPPT T'. When NextCore receives a candidate node v 
containing a maximal core q as its descendant, it computes the pair (u,p) where 
u is the next occurrence of v in T' and p is the path from u to q. Thus, (u,p) 
indicates the occurrence of q in the explicit parse tree. We show the correctness 
of this algorithm and its complexity. 

Lemma 1. The Nextcore find any occurrence of the maximal core exactly 
once. The amortized time to find a next occurrence is ^ )■ 

Proof. Let T be the parse tree and T' be the POPPT (i?, L). By the definition 
of r', any internal node x of i? is the variable itself, i.e., Label(x) = x. For the 
maximal core g, let ui > ^2 > • • • > Vk be the post-order of its occurrences in 
T. We show that the algorithm finds any Vi as {u,p) by induction on i. Given 
g, the internal node q oi B represents the leftmost occurrence of q itself. Then, 
for the base case i = 1, the occurrence is obtained vi as {q,p) with \p\ = 0. 
Assume the induction hypothesis on some i. Since the node Uj+i was pruned 





in r', let u be the leaf of T' corresponding to the root of the pruned maximal 
subtree containing Vi+i. For Label(u) = u', there is the leftmost occurrence of 
u' as an internal node of B. The subtree on the node u' contains an occurrence 
of q because the two subtrees on u and u' in T are identical each other. Let p be 
the path from u' to v' for some v' G {ui,... ,Vi]. By the induction hypothesis, 
the algorithm finds v' as {u',p). Then, Vi+i can be also found as (u,p). On the 
other hand, any (u,p) is unique, then the algorithm finds any occurrence of q 
exactly once. For the time complexity, the number of executed select operations 
is bounded by the number of different (u,p) that is 0{ocCq log IS”!) where occq is 
the number of occurrences of q. Each select operation on L and parent operation 
on B take 0(lg(n + tT)) and Q( ig^”„ ) time, respectively. Therefore, the total time 

is 0{ocCq{\g{n + cr) + *^ig^ )) = )) and the amortized time to 

find a next occurrence of q is Q( ^^igigjf ^ )■ 

Theorem 4. The counting/locating time of pattern and extraction time are 
0(lg(n + cr)(i^ + ocCq(lg|S'| + IglPllg^l^l))) and 0(lg(n + a){\P\ + lg|S'|)), 
respectively, where P is a query pattern and ocCq is the number of occurrences 
of the maximal core of P in the parse tree. 

Proof. Since we can get the length of the substring encoded by any variable in 
0(1) time, the locating time is same as the counting time. Given the pattern P, 
as previously shown, the evidence Q of P is found in 0{^\P\\g{n + cr)lg*|S'|) 
time. For each occurrence of a maximal core, we can check if the sequence of 
symbols of length 0(lg|P|lg* l^l) is embedded around the core in 0(lg(n + 
cr)(lg l^l + Ig |P| Ig* I^D) time. Therefore, by Lemma [U the total counting time 
of pattern is 

O (—lg(n + (T)lg*|S'| +occ,lg(n + (T)(lg|S'| + Ig |P| lg*|S'|) + ) 

a ig ig n 

= 0(lg(n + cr)(— + occqilg l^l + Ig |P| lg*|5'|))). 
a 

On the other hand, for any S'[i, j] of length m, we can find S'[f] in 0(lg IS”!) time 
and visit all leaves in S[i,j] in 0(|P|) time because the parsing tree is balanced. 
This follows the extraction time. 


5 Experiments 

We evaluated the actual performance of OESP-index for real datc0- The en¬ 
vironment is Intel(R) Core(TM)i7-2620M CPU(2.7GHz) machine with 16GB 
memory. We use einstein.en.txt (einstein, 446 MB) and cere (cere, 440 MB), 
where einstein is highly repetitive. 


^ http://pizzachili.dec.uchile.cl/repcorpus/real/ 








' of text(MB) 


Fig. 4. Working space of dictionary D, len 
dnstein (left) and cere (right). 



Fig. 5. Construction time of each method 
:ere (right). 
















Table 2. Index size in mega bytes(MB). 



OESP-index 

ESP-index 

SLP-index 

LZ-index 

FM-index 

einstein 

22.84 

1.76 

2.28 

177.02 

942.85 

cere 

364.92 

27.40 

45.74 

438.05 

806.52 


Table 3. Working memory of dictionary D consisting of the bit string B and 
the dynamic wavelet tree L for einstein and cere. 


Size of text (MB) 

50 

100 

150 

200 

250 

300 

350 

400 

all 

einstein 

B(MB) 

0.04 

0.07 

0.07 

0.07 

0.14 

0.14 

0.14 

0.14 

0.14 


L(MB) 

5.06 

6.63 

7.84 

8.99 

10.88 

12.23 

13.38 

14.37 

15.14 

cere 

B(MB) 

1.10 

1.10 

1.10 

2.20 

2.20 

2.20 

2.20 

2.20 

2.20 


L(MB) 

102.34 

131.58 

164.05 

179.90 

199.08 

216.87 

225.70 

235.62 

245.81 


Compared self-indexes are offline version of ESP-index (ESP-index) [19 .other 
grammar-based self-index fSLP-indexl [ll2] . LZ-based index (LZ-indexjj, and 
BWT-based self-index (FM-indexj^. Figure [3] shows the required working mem¬ 
ory (MB) in response to an increase of input string. For the offline algorithms, 
the working memory is evaluated for each static data with the indicated size. 
Figure m is the breakdown of required memory by the data structures of OESP- 
index: dictionary D, length array i?, and hash table H. Besides, Table [3] is the 
breakdown of D by the bit string B and the wavelet tree L. 

Table [3] shows the size of indexes of all methods. The size of OESP-index 
is smaller than LZ-index and FM-index but larger than ESP-index and SLP- 
index. The increase of index size arise from DWT. Reducing this data size is an 
important future work. 

The memory consumption of OESP-index is smallest for both type of data. 
The required memory of OESP-index is 2.5%(einstein) and 40%(cere) of offline 
ESP-index. The space efficiency of OESP-index comes down when the data is 
not large and not highly repetitive (Figured (right)). Especially, L represented 
by the dynamic wavelet tree (DWT) consumes a large space (Table [S]) arising 
from the pointer and the reservation space of bit string in DWT. 

Figure [5] shows the construction time. OESP-index is slowest for both data 
in all methods. OESP-index is 57.1 times (einstein) and 58.1 times (cere) slower 
than ESP-index because the original one can use GMR |7] , a faster wavelet tree 
algorithm but not available in the online version. 

Figure ini shows the search time. Here the search time means the locating time 
since the counting time is almost same to the locating time. We note that the 
result of SLP-index is not shown because it could not work for this data. The 
range of the length of query pattern is [10,1000]. The locating time of OESP- 

^ http://pizzachili.dec.uchile.cl/indexes/LZ-index/LZ-indexl 
® https://code.google.com/p/fmindex-plus-plus/ 






















index is slowest in both data in all query length. OESP-index is 163.2 times 
(cere) and 24.9 times (einstein) slower than ESP-index. 

6 Conclusion 

We have presented OESP-index, an online self-indexed grammar compression. 
OESP-index is the first method for building grammar-based self-indexes in an 
online manner. Experimental results demonstrated OESP-index’s potential for 
processing a large collection of highly repetitive texts. Future work is to make 
OESP-index scalable to massive collections of the same type, which is required 
in the big data era. 
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